Red Wine Quality(红酒品质相关数据集)

您所在的位置:网站首页 红酒red wine Red Wine Quality(红酒品质相关数据集)

Red Wine Quality(红酒品质相关数据集)

2024-02-06 15:42| 来源: 网络整理| 查看: 265

原文:

Red Wine Quality

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

For more information, read [Cortez et al., 2009].

 

Input variables (based on physicochemical tests):

 

1 - fixed acidity

 

2 - volatile acidity

 

3 - citric acid

 

4 - residual sugar

 

5 - chlorides

 

6 - free sulfur dioxide

 

7 - total sulfur dioxide

 

8 - density

 

9 - pH

 

10 - sulphates

 

11 - alcohol

 

Output variable (based on sensory data):

 

12 - quality (score between 0 and 10)

 

Tips

What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'.

This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value.

Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)

KNIME is a great tool (GUI) that can be used for this.

 

1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA.

 

2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this:

 

$quality$ > 6.5 => "good"

TRUE => "bad"

 

3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking)

 

4- Column Filter Node output to input of Partitioning Node (your standard train/tes split, e.g. 75%/25%, choose 'random' or 'stratified')

 

5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and

 

6- Partitioning Node test data split output to input Decision Tree predictor Node

 

7- Decision Tree learner Node output to input Decision Tree Node input

 

8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value)

 

译:

红酒品质

这两个数据集与葡萄牙“维诺维德”葡萄酒的红色和白色变体有关。有关更多详细信息,请参阅参考文献[Cortez等人,2009]。由于隐私和物流问题,只有物理化学(输入)和感官(输出)变量可用(例如,没有关于葡萄类型、葡萄酒品牌、葡萄酒销售价格等的数据)。

这些数据集可以看作是分类或回归任务。等级是有序的,不均衡(例如,普通葡萄酒比优质或劣质葡萄酒多得多)。

更多信息,请阅读[Cortez等人,2009]。

输入变量(基于理化试验):

1-固定酸度

2-挥发性酸度

3-柠檬酸

4-残糖

5-氯化物

6-游离二氧化硫

7-总二氧化硫

8-密度

9相

10-硫酸盐

11-酒精

输出变量(基于感官数据):

12-质量(分数在0到10之间)

提示

除了使用回归建模之外,还有一件有趣的事情,就是为你的因变量(葡萄酒质量)设定一个任意的临界值,例如7或更高,被归类为“良好/1”,其余的被归类为“不好/0”。

这允许您在查看ROC曲线和AUC值的决策树算法上练习超参数调整。

不做任何类型的功能工程或过度拟合,你应该能够得到一个0.88的AUC(甚至不使用随机森林算法)

KNIME是一个很好的工具(GUI),可以用于此。

1-文件读取器(用于csv)到线性相关节点和交互式直方图,用于基本EDA。

2-文件读取器到“规则引擎节点”将10点刻度转换为dichtome变量(好酒和休息),要放入规则引擎的代码如下:

●$quality$>6.5=>“良好”

●真=>“坏”

3-规则引擎节点输出到列过滤器节点的输入,过滤掉原来的10点特征(这可以防止泄漏)

4列过滤器节点输出到分区节点的输入(您的标准列车/tes拆分,例如75%/25%,选择“随机”或“分层”)

5-分割节点列车数据分割输出到输入,列车数据分割到输入决策树学习节点和

6-分区节点测试数据将输出分割到输入决策树预测节点

7-决策树学习者节点输出到输入决策树节点输入

8-决策树输出到输入ROC节点。。(在这里您可以根据AUC值评估您的模型)

链接:获取数据集



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3